Reduce unnecessary file downloading in archive/file states #43518

terminalmage · 2017-09-15T00:34:45Z

The file.managed state, which is used by the archive.extracted state to download the source archive, at some point recently was modified to clear the file from the minion cache. This caused unnecessary re-downloading on subsequent runs, which slows down states considerably when dealing with larger archives.

This PR makes the following changes to improve this:

The fileclient now accepts a source_hash argument, which will cause the client's get_url function to skip downloading http(s) and ftp files if the file is already cached, and its hash matches the passed hash. This argument has also been added to the cp.get_url and cp.cache_file functions.
We no longer try to download the file when it's an http(s) or ftp URL when running file.source_list.
Where cp.cache_file is used, we pass the source_hash if it is available.
A ~~cache_source~~ keep_source argument has been added to the file.managed state, defaulting to True. This is now used to control whether or not the source file is cleared from the minion cache when the state completes.
Two new states (file.cached and file.not_cached) have been added to manage files in the minion cache.

In addition, the archive.extracted state has been modified in the following ways:

For consistency with file.managed, a ~~cache_source~~ keep_source argument has been added. This also deprecates keep. If keep is used, keep_source assumes its value, and a warning is added to the state return to let the user know to update their SLS.
The variable name cached_source (used internally in the archive.extracted state) has been renamed to cached to reduce confusion with the new ~~cache_source~~ keep_source argument. (Note: while cache_source is now keep_source, I'm not going to go back and rename these variables back to cached_source. They will stay as-is.)
The new file.cached and file.not_cached states are now used to manage the source tarball instead of file.managed. This improves disk usage and reduces unnecessary complexity in the state as we no longer keep a copy of the archive in a separate location within the cachedir. We now only use the copy downloaded using cp.cache_file within the file.cached state. This change has also necessitated a new home for hash files tracked by the source_hash_update argument, in a subdirectory of the minion cachedir called archive_hash.

Resolves #38971.

damon-atkins · 2017-09-15T02:40:39Z

I have had to use cp to cache a dir then had to use cp.is_cached to get a path to a file. Is there a need for cp.cache_local_path (_extrn_path) so can get file name translation from source to local cache?

With regards to caching http and http, can we use If-Modified-Since or keep a copy of the header and compare key values in the header to determine of the file needs to be downloaded again. Maybe cache/.../extern_meta which contains the http/https headers for extrn_files

Other thing which might help is keep the time stamp on the file the same as the time stamp on the source if possible.

I can raise these idea's as an issue if you like.

terminalmage · 2017-09-15T02:57:01Z

I have had to use cp to cache a dir then had to use cp.is_cached to get a path to a file. Is there a need for cp.cache_local_path so can get file name translation from source to local cache?

cp.is_cached will give the cached path if the file is cached, and an empty string if it is not cached. Why is this not sufficient?

With regards to caching http and http seem it needs to use If-Modified-Since or keep a copy of the header and compare key values in the header to determine of the file needs to be downloaded again. Maybe cache/.../extern_meta which contains the http/https headers for extrn_files

Other thing which might help is keep the time stamp on the file the same as the time stamp on the source if possible.

I honestly have no idea what you're talking about here, sorry.

The source_hash will be used to drive whether or not the file is downloaded. If the locally-cached file matches the configured source_hash, we don't download. Before, we'd download every time, and then compare the hash to the source_hash. If it didn't match, we'd fail the state anyway, so there would be a lot of unnecessary downloading for what result in a failed state. This would allow the state to continue to work if the remote file changes, and if you want the new copy of the file, you just update the source_hash.

damon-atkins · 2017-09-15T03:23:52Z

The follow are two example of http response headers
From GET https://the.earth.li/~sgtatham/putty/0.70/w64/putty-64bit-0.70-installer.msi

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:11:50 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: application/x-msi

From inkscape.

HTTP/1.1 200 OK
Server: nginx/1.10.3
Content-Type: application/octet-stream
Last-Modified: Mon, 07 Aug 2017 16:11:49 GMT
ETag: "59889145-4f41ab7"
Expires: Tue, 10 Oct 2017 11:02:31 GMT
Cache-Control: max-age=2592000
Content-Disposition: attachment; filename="inkscape-0.92.2-x64.msi"
Content-Length: 83106487
Accept-Ranges: bytes
Date: Fri, 15 Sep 2017 03:04:31 GMT
Via: 1.1 varnish
Connection: keep-alive
X-Served-By: cache-mel6525-MEL
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1505444671.241058,VS0,VE1

For example if Content-Length, ETag and Last-Modified have not change the existing cached file is ok.
Cache-Control could be used to clean out old files.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since
The the putty example when requesting the file "GET"
If-Modified-Since: Tue, 04 Jul 2017 19:36:06 GMT
The web servers responds with 304 Not Modified.

Or using the etag
If-None-Match: "2e8600-55382fea0ec86"
The web servers responds with 304 Not Modified.

Or send both.

Or use HEAD first and compare and then GET (fetch) if needed, which might be easier to code.
curl -I https://the.earth.li/~sgtatham/putty/0.70/w64/putty-64bit-0.70-installer.msi

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:22:01 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Content-Type: application/x-msi

By keeping the header response from GET, it allows minion to determine of the file is valid in the cache or not.

cache/.../extern_meta/...../putty-64bit-0.70-installer.msi would contain the Header from the original GET request.

HTTP/1.1 200 OK
Date: Fri, 15 Sep 2017 03:11:50 GMT
Server: Apache
Last-Modified: Tue, 04 Jul 2017 19:36:06 GMT
ETag: "2e8600-55382fea0ec86"
Accept-Ranges: bytes
Content-Length: 3048960
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: application/x-msi

If the HEAD request does not match the GET Header response in cache/.../extern_meta/...../putty-64bit-0.70-installer.msi when comparing Last-Modified, ETag, Content-Length, and Content-Type then fetch the file again, instead of using the cache version. But it must contain Last-Modified or ETag or both to be able to do a valid comparison. If the header missing Last-Modified and ETag and or Content-Length general means the contents is dynamic.

ari · 2017-09-15T07:59:43Z

My feedback is that this change is unnecessarily complex and the documentation very lacking.

Why rename keep to cache_source? These sorts of arbitrary changes to suit the style of a new developer make upgrading salt more painful than necessary.
The concept file.cached and its interaction with other file states isn't clear to me. Can't we just return the old behaviour of file states with regard to the keep attribute? That is, don't clear the file from the minion (if necessary by putting it or a hash into a new archive_hash folder -- that's poorly named by the way since it isn't specific to archives).
If file.cached is really important and helpful, then it needs much more documentation about how to use it in conjunction with other file states.

cachedout · 2017-09-15T15:54:41Z

@terminalmage I'm wondering here about 2017.7 versus develop. Seems like the latter might be a better target but let me know your thoughts.

terminalmage · 2017-09-20T16:35:42Z

Why rename keep to cache_source? These sorts of arbitrary changes to suit the style of a new developer make upgrading salt more painful than necessary.

It's not an arbitrary change. The new caching behavior doesn't apply just to the archive.extracted state, it also applies to the file.managed state. An argument named keep doesn't make any sense in file.managed. It's just not descriptive enough, and I think I can safely make that declaration having been a contributor for almost six years, and having been employed full-time by SaltStack for four of those years.

I'm more than willing to have my fellow long-time contributors weigh in on this though, and use keep in both archive.extracted and file.managed. Or, perhaps, I can modify the code such that either keep or cache_source are valid arguments to archive.extracted, and both do the same thing. @cachedout thoughts?

The concept file.cached and its interaction with other file states isn't clear to me. Can't we just return the old behaviour of file states with regard to the keep attribute? That is, don't clear the file from the minion (if necessary by putting it or a hash into a new archive_hash folder -- that's poorly named by the way since it isn't specific to archives).

Using a file.managed state to download the archive was never a good idea in the first place. The reason for this is because it first downloads the file into the cachedir, and then places a copy of it in the final destination. Managing files within the cachedir using file.managed is just not a good idea, that's not what it was designed to do. So, to work around this, the archive.extracted state would create a separate path to use as the final destination for the cached archive file.

The new file.cached state simplifies what we need to do to ensure that the archive file is present in the cachedir by simply ensuring that copy has been cached using cp.cache_file, without the need to devise an alternate destination for the cached archive.

With regard to the archive_hash directory being "poorly named", I'm sorry but you are just incorrect. This is a brand new directory being added just for the archive.extracted state, and it is only used if the source_hash_update argument is set to True. source_hash_update is used to force the archive to be extracted if the file paths within the archive don't change but the hash of the source archive does. In the past, we would place this hash file alongside the separate path we had created for the copy of the archive we had downloaded using file.managed. Since we're not using file.managed anymore for this, we need a new home for these hash files, and placing them in the normal path under the cachedir could result in collisions if we do a cp.cache_file on a URL which would end up being cached to the same path. Thus, the new archive_hash directory which, yes, is specific to archives.

If file.cached is really important and helpful, then it needs much more documentation about how to use it in conjunction with other file states.

Since it's designed to be used internally within state functions, and not "in conjunction" with them, I don't know how critical adding a bunch of documentation is. I can by all means add a couple code examples, but this state will rarely be useful in an SLS file, and I have already mentioned in the docstring for file.cached that it is designed for internal use within states.

terminalmage · 2017-09-20T17:36:59Z

I've added some code examples to the docstring for the file.cached state. I also noticed an issue with invoking a state function via __states__ instead of via a HighState instance (as we used to do), which I've also corrected.

@cachedout The unnecessary re-downloading started happening because of changes to file.managed in 2017.7.0, so I considered this a regression that needed fixing in the 2017.7 release branch. If you'd rather push this to the next feature release though, I can understand.

cachedout · 2017-09-20T19:26:08Z

@terminalmage I suppose it depends on our view of performance regressions. My view here is that changes to the file client are so fundamental to the behavior and stability of Salt that I like to take the more conservative route and defer them to major releases. I'm not necessarily opposed to merging it here but I'd feel more comfortable in the develop branch.

terminalmage · 2017-09-20T19:31:08Z

@cachedout I don't have an objection to rebasing against develop.

ari · 2017-09-20T23:50:20Z

@terminalmage

An argument named keep doesn't make any sense in file.managed. It's just not descriptive enough, and I think I can safely make that declaration having been a contributor for almost six years, and having been employed full-time by SaltStack for four of those years.

Thank you for all your work on this project. It has been something very important to me for the last two years.

I come from the perspective of a user of salt, who struggled to first learn and understand its concepts and live in fear of every upgrade and what part of my configuration will be broken. I love the concepts in salt, but hate its fluid nature and lack of documentation, particularly examples of best practice.

With that out of the way, I get that you feel 'cache_source' is clearer, but to me it isn't. That's just one person so make of it what you will. When I read that I see "cache" as a noun, not a verb, so I expect to put a path as the value, not a boolean. "keep" is always a verb. But more importantly, breaking my working states with each upgrade because of a better name isn't very helpful to your users. It might be that "ls" isn't the most intuitive way to get a directory listing in Unix, there is no discussion about changing it now.

Remember that every time you change some naming you break not only everyone's working system (yes, everyone should read the release notes in detail and understand every implication, but they don't), but also you invalidate all documentation, forum posts and other information already out there.

Perhaps this is a little painting the bike shed, but for something as business critical as a configuration management system, clarity and consistency are so so important.

Thanks for listening, and thanks for your detailed explanation. I hope that you could put even a fraction of what you explained to me in the official documentation. Your description of archive_hash could go into the docs almost as is. Everyone would be much the wiser.

terminalmage · 2017-09-21T05:36:22Z

The keep usage still works, albeit with a warning. It doesn't "break everyone's working system".

I hope that you could put even a fraction of what you explained to me in the official documentation. Your description of archive_hash could go into the docs almost as is. Everyone would be much the wiser.

Buy why? The choice of where we store the hash files tracked by source_hash_update (i.e. archive_hash, as changed in this PR) is not user-facing. It's entirely internal. What is the benefit of explaining something like this when it will almost certainly never need to be interacted with by users? The docs are a user reference, not a developer's reference.

terminalmage · 2017-09-21T06:32:28Z

I've modified the PR so that cache_source is now keep_source, and I've gotten rid of the warning when keep is used. If both keep_source and keep are used, keep is ignored.

@cachedout I've also rebased onto develop. I think I have a way of incorporating the non-fileclient fixes from this PR against 2017.7, which I can do in a separate PR.

The file.managed state, which is used by the archive.extracted state to download the source archive, at some point recently was modified to clear the file from the minion cache. This caused unnecessary re-downloading on subsequent runs, which slows down states considerably when dealing with larger archives. This commit makes the following changes to improve this: 1. The fileclient now accepts a `source_hash` argument, which will cause the client's get_url function to skip downloading http(s) and ftp files if the file is already cached, and its hash matches the passed hash. This argument has also been added to the `cp.get_url` and `cp.cache_file` function. 2. We no longer try to download the file when it's an http(s) or ftp URL when running `file.source_list`. 3. Where `cp.cache_file` is used, we pass the `source_hash` if it is available. 4. A `cache_source` argument has been added to the `file.managed` state, defaulting to `True`. This is now used to control whether or not the source file is cleared from the minion cache when the state completes. 5. Two new states (`file.cached` and `file.not_cached`) have been added to managed files in the minion cache. In addition, the `archive.extracted` state has been modified in the following ways: 1. For consistency with `file.managed`, a `cache_source` argument has been added. This also deprecates `keep`. If `keep` is used, `cache_source` assumes its value, and a warning is added to the state return to let the user know to update their SLS. 2. The variable name `cached_source` (used internally in the `archive.extracted` state) has been renamed to `cached` to reduce confusion with the new `cache_source` argument. 3. The new `file.cached` and `file.not_cached` states are now used to manage the source tarball instead of `file.managed`. This improves disk usage and reduces unnecessary complexity in the state as we no longer keep a copy of the archive in a separate location within the cachedir. We now only use the copy downloaded using `cp.cache_file` within the `file.cached` state. This change has also necessitated a new home for hash files tracked by the `source_hash_update` argument, in a subdirectory of the minion cachedir called `archive_hash`.

…e.extracted The code used to have a salt.state.HighState instance call the state. That method pre-dated availability of __states__ to use for executing the state function. The HighState instance handles exception catching and produces a list as output if there were errors which arose before the state was executed. Running the state function using __states__ does not give you any such protection. This commit removes the type check on the return data, as it will never be a list when run via __states__, and wraps the state function in a try/except to catch any exceptions that may be raised by invoking the file.cached state.

This new argument name is ambiguous as "cache" can either be interpreted as a noun or a verb.

max-arnold · 2022-06-26T09:15:05Z

Etag support has been implemented: #61764 #61391

terminalmage force-pushed the issue38971 branch from 26b7f6b to 34fd89c Compare September 15, 2017 00:34

This was referenced Sep 15, 2017

archive.extracted: lots of unnecessary file transferring, copying, and hashing #38971

Closed

file.managed and archive.extracted don't cache remote files #42681

Closed

file.managed always redownloads the file #43408

Closed

terminalmage added ZRELEASED - 2017.7.3 bugfix-bckport will be be back-ported to an older release branch by creating a PR against that branch labels Sep 15, 2017

terminalmage force-pushed the issue38971 branch from 34fd89c to 32e117b Compare September 15, 2017 01:47

saltstack deleted a comment from KumarRajnish Sep 19, 2017

terminalmage force-pushed the issue38971 branch from 2d75df1 to c745ea9 Compare September 20, 2017 19:03

terminalmage changed the base branch from 2017.7 to develop September 21, 2017 06:29

terminalmage force-pushed the issue38971 branch from c745ea9 to 50c1b8e Compare September 21, 2017 06:29

terminalmage removed ZRELEASED - 2017.7.3 bugfix-bckport will be be back-ported to an older release branch by creating a PR against that branch labels Sep 21, 2017

terminalmage force-pushed the issue38971 branch 3 times, most recently from be9bdb6 to 9d749af Compare September 21, 2017 07:32

terminalmage mentioned this pull request Sep 21, 2017

Backport the non-fileclient changes from PR 43518 to 2017.7 #43681

Merged

terminalmage force-pushed the issue38971 branch from 9d749af to c8ba934 Compare September 21, 2017 17:01

terminalmage added 5 commits September 21, 2017 12:04

Add documentation for using file.cached state within other states

68ae5a2

Rename cache_source arg to keep_source

4d7bcff

This new argument name is ambiguous as "cache" can either be interpreted as a noun or a verb.

Improve logging when keeping the cached source file

8b237c1

terminalmage force-pushed the issue38971 branch from c8ba934 to 8b237c1 Compare September 21, 2017 17:05

Merge branch 'develop' into issue38971

49bb3b0

cachedout merged commit 7e3a6b9 into saltstack:develop Sep 22, 2017

terminalmage deleted the issue38971 branch September 25, 2017 19:33

terminalmage mentioned this pull request Oct 20, 2017

archive.extracted: don't try to cache local sources #38285

Merged

Ch3LL mentioned this pull request Oct 27, 2017

file.managed state redownloads an already present file with correct source_hash #44307

Closed

Ch3LL mentioned this pull request Apr 30, 2018

Document how to avoid re-downloading large files #47373

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce unnecessary file downloading in archive/file states #43518

Reduce unnecessary file downloading in archive/file states #43518

terminalmage commented Sep 15, 2017 •

edited

Loading

damon-atkins commented Sep 15, 2017 •

edited

Loading

terminalmage commented Sep 15, 2017 •

edited

Loading

damon-atkins commented Sep 15, 2017 •

edited

Loading

ari commented Sep 15, 2017

cachedout commented Sep 15, 2017

terminalmage commented Sep 20, 2017

terminalmage commented Sep 20, 2017

cachedout commented Sep 20, 2017

terminalmage commented Sep 20, 2017

ari commented Sep 20, 2017

terminalmage commented Sep 21, 2017

terminalmage commented Sep 21, 2017

max-arnold commented Jun 26, 2022

Reduce unnecessary file downloading in archive/file states #43518

Reduce unnecessary file downloading in archive/file states #43518

Conversation

terminalmage commented Sep 15, 2017 • edited Loading

damon-atkins commented Sep 15, 2017 • edited Loading

terminalmage commented Sep 15, 2017 • edited Loading

damon-atkins commented Sep 15, 2017 • edited Loading

ari commented Sep 15, 2017

cachedout commented Sep 15, 2017

terminalmage commented Sep 20, 2017

terminalmage commented Sep 20, 2017

cachedout commented Sep 20, 2017

terminalmage commented Sep 20, 2017

ari commented Sep 20, 2017

terminalmage commented Sep 21, 2017

terminalmage commented Sep 21, 2017

max-arnold commented Jun 26, 2022

terminalmage commented Sep 15, 2017 •

edited

Loading

damon-atkins commented Sep 15, 2017 •

edited

Loading

terminalmage commented Sep 15, 2017 •

edited

Loading

damon-atkins commented Sep 15, 2017 •

edited

Loading